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ABSTRACT 

The  availability  of  real-time  continuous  speech  recognition  on 
mobile  and  embedded  devices  has  opened  up  a  wide  range  of  re¬ 
search  opportunities  in  human-computer  interactive  applications.  Un¬ 
fortunately,  most  of  the  work  in  this  area  to  date  has  been  con¬ 
fined  to  proprietary  software,  or  has  focused  on  limited  domains 
with  constrained  grammars.  In  this  paper,  we  present  a  preliminary 
case  study  on  the  porting  and  optimization  of  CMU  SPHINX-II,  a 
popular  open  source  large  vocabulary  continuous  speech  recogni¬ 
tion  (LVCSR)  system,  to  hand-held  devices.  The  resulting  system 
operates  in  an  average  0.87  times  real-time  on  a  206MHz  device, 
8.03  times  faster  than  the  baseline  system.  To  our  knowledge,  this 
is  the  first  hand-held  LVCSR  system  available  under  an  open-source 
license. 

1.  INTRODUCTION 

Mobile,  embedded,  and  hands-free  speech  applications  fundamen¬ 
tally  require  continuous,  real-time  speech  recognition.  For  example, 
an  intelligent,  interactive  personal  information  assistant  where  nat¬ 
ural  speech  has  replaced  the  cumbersome  stylus  input  and  cramped 
graphical  user  interface  of  a  PDA.  Many  current  applications,  such 
as  speech  control  of  GPS  navigation  systems  and  speech-controlled 
song  selection  for  portable  music  players  and  car  stereos  also  require 
a  reliable  and  flexible  speech  interface.  Finally,  sophisticated  natural 
language  applications  such  as  handheld  speech-to-speech  translation)  1] 
require  fast  and  lightweight  speech  recognition. 

Several  technical  challenges  have  hindered  the  deployment  of 
such  applications  on  embedded  devices.  The  most  difficult  of  these 
is  the  computational  requirements  of  continuous  speech  recognition 
for  a  medium  to  large  vocabulary  scenario.  The  need  to  minimize  the 
size  and  power  consumption  for  these  devices  leads  to  compromises 
in  their  hardware  and  operating  system  software  that  further  restrict 
their  capabilities  below  what  one  might  assume  from  their  raw  CPU 
speed.  For  example,  embedded  CPUs  typically  lack  hardware  sup¬ 
port  for  floating-point  arithmetic.  Moreover,  memory,  storage  capac¬ 
ity  and  bandwidth  on  embedded  devices  are  also  very  limited.  For 
these  reasons,  much  of  past  work  (e.g.  [2],  [3])  has  concentrated  on 
simple  tasks  with  restrictive  grammars. 
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In  addition  to  hardware  limitations,  interested  developers  face 
a  high  barrier  in  building  such  systems.  It  requires  access  to  pro¬ 
prietary  speech  recognition  toolkits  which  are  often  expensive  and 
usually  provided  without  source  code.  As  well,  popular  embedded 
operating  systems  may  lack  many  of  the  features  developers  take  for 
granted  on  modern  desktop  systems,  most  notably  a  complete  stan¬ 
dard  C/C++  programming  library  and  a  fast  virtual  memory  subsys¬ 
tem. 

POCKETSPHINX  is  the  authors’  attempt  to  address  the  above 
issues.  Our  work  builds  on  previous  research  in  the  Carnegie  Mellon 
Speech  group  related  to  fast  search  techniques  ([4J  and  [5])  and  fast 
GMM  computation  techniques  ([6],  [7]  and  [8]).  We  believe  that  this 
work  will  benefit  the  development  community  and  lead  to  the  easier 
creation  of  interesting  speech  applications.  Therefore,  we  have  made 
this  work  available  to  the  public  without  cost  under  an  open-source 
license.  To  the  best  of  our  knowledge,  this  is  the  first  open-source 
embedded  speech  recognition  system  that  is  capable  of  real-time, 
medium-vocabulary  continuous  speech  recognition. 

2.  BASELINE  SPHINX-II  SYSTEM 

The  target  hardware  platform  for  this  work  was  the  Sharp  Zaurus  SL- 
5500  hand-held  computer.  The  Zaurus  is  typical  of  the  previous  gen¬ 
eration  of  hand-held  PCs,  having  a  206MHz  StrongARM®  proces¬ 
sor,  64MB  of  SDRAM,  16MB  of  flash  memory,  and  a  quarter-VGA 
color  LCD  screen.  We  chose  this  particular  device  because  it  runs 
the  GNU/Linux®  operating  system,  simplifying  the  initial  port  of 
our  system.  However,  the  CPU  speed  and  memory  capacity  of  this 
device  are  several  years  behind  the  current  state  of  the  art,  making 
it  commensurately  more  difficult  to  achieve  the  desired  level  of  per¬ 
formance.  To  build  our  system,  we  used  a  GCC  3.4.1  cross-compiler 
built  with  the  crosstool  script1. 

Platform  speed  directly  affected  our  choice  of  a  speech  recogni¬ 
tion  system  for  our  work.  Though  all  the  members  of  the  SPHINX 
recognizer  family  have  well-developed  programming  interfaces,  and 
are  actively  used  by  researchers  in  fields  such  as  spoken  dialog  sys¬ 
tems  and  computer-assisted  learning,  we  chose  the  SPHINX-II  rec¬ 
ognizer2  as  our  baseline  system  because  it  is  faster  than  other  recog¬ 
nizers  currently  available  in  the  SPHINX  family. 

To  evaluate  our  system’s  performance,  we  used  400  utterances 
randomly  selected  from  the  evaluation  portion  of  the  DARPA  Re- 

1  http://kegel.com/crosstool/ 

2http://www.cmusphinx.org/ 
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source  Management  (RM-1)  corpus.  The  acoustic  model  uses  Hid¬ 
den  Markov  Models  with  a  5-state  Bakis  topology  and  semi-continuous 
output  probabilities.  It  was  trained  from  1600  utterances  from  the 
RM-1  speaker-independent  training  corpus,  using  256  tied  Gaussian 
densities,  1245  tied  Gaussian  Mixture  Models  (senones),  and  39570 
context-dependent  triphones.  The  input  features  consisted  of  four  in¬ 
dependent  streams  of  MFCC  features,  delta  and  delta-delta  MFCCs, 
and  power.  A  bigram  statistical  language  model  was  used  with  a  vo¬ 
cabulary  of  994  words  and  a  language  weight  of  9.5.  The  test  set 
perplexity  of  this  language  model  is  50.86. 

On  our  development  workstation,  a  3GHz  Intel®  Pentium®  4 
running  GNU/Linux,  SPHINX-II  runs  this  task  in  0.06  xRT.  After 
the  first  stage  of  porting  the  system  to  the  Zaurus  target  platform, 
without  applying  any  optimizations,  this  same  task  takes  7.15  xRT. 
The  baseline  word  error  rate  is  9.73%.  Clearly,  this  is  much  too  slow 
to  be  useful  for  even  the  simplest  recognition  tasks. 

3.  PLATFORM  OPTIMIZATIONS 

In  the  next  stage  of  development,  we  investigated  potential  speed-ups 
based  on  our  knowledge  of  the  hardware  platform.  First,  we  noted 
that  for  embedded  devices,  memory  access  is  slow  and  RAM  is  at 
a  premium.  We  made  several  changes  to  the  system  to  address  this 
problem,  described  in  Section  3.1.  Second,  the  data  representation 
was  not  optimal  for  the  capabilities  of  the  target  CPU.  Lastly,  some 
implementation  details  led  to  inefficient  code  being  generated  for 
the  target  platform.  The  changes  we  made  to  address  these  issues 
are  described  in  Section  3.2. 

3.1.  Memory  Optimizations 

Memory-mapped  file  I/O:  For  embedded  devices,  where  RAM  is  a 
scarce  resource,  acoustic  model  data  should  be  marked  as  read-only 
so  that  it  can  be  read  directly  from  ROM.  On  embedded  operating 
systems,  the  ROM  is  usually  structured  as  a  filesystem,  and  thus  it 
can  be  accessed  directly  by  using  memory-mapped  file  I/O  functions 
such  as  mmap  ( 2  )  on  Unix  orMapViewOfFile  ( )  on  Windows. 

Byte  ordering:  Unfortunately,  the  original  binary  formats  for 
SPHINX-II  acoustic  and  language  models  were  not  designed  with 
read-only  access  in  mind.  In  particular,  they  used  a  canonical  byte- 
order  that  requires  them  to  be  read  into  memory  and  then  byte- 
swapped.  We  modified  the  HMM  trainer,  SphinxTrain,  to  output 
these  files  in  the  target  system's  native  byte-order.  We  then  modi¬ 
fied  SPHINX-II  to  use  existing  header  fields  to  determine  the  byte¬ 
ordering  of  the  file,  thus  allowing  memory-mapping  for  files  in  the 
native  byte  order. 

Data  alignment:  Modern  CPUs  either  require  or  strongly  en¬ 
courage  aligned  data  access.  For  example,  a  32-bit  data  field  is  re¬ 
quired  to  be  aligned  on  a  4-byte  address  boundary.  Where  the  model 
file  formats  mixed  data  fields  of  different  widths,  it  was  necessary  to 
insert  padding  to  ensure  proper  alignment.  The  result  is  that,  while 
our  version  can  read  model  files  from  previous  versions,  the  files 
generated  by  it  are  not  backward-compatible. 

Efficient  representation  of  Triphone-senone  mapping:  Gen¬ 
erally  in  SPHINX-II  these  are  read  from  two  large  text  files,  and 
stored  in  equally  large  hash  tables.  By  contrast,  the  SPHINX-III 
system  uses  a  single,  compact  "model  definition"  file  which  is  repre¬ 
sented  by  a  tree  structure  in  memory,  a  much  more  memory-efficient 
solution.  Therefore,  we  back-ported  the  model  definition  code  from 
SPHINX-III  to  our  system,  producing  a  significant  reduction  in  mem¬ 
ory  consumption  and  a  much  faster  startup  time. 


3.2.  Machine-Level  Optimizations 

Use  of  Fixed  Point  Arithmetic:  The  StrongARM  processor  has  no 
hardware  support  for  floating-point  operations.  Floating-point  com¬ 
putations  must  be  emulated  in  software,  usually  by  a  set  of  math  rou¬ 
tines  provided  by  the  compiler  or  by  the  runtime  library.  Since  these 
routines  must  exactly  replicate  the  functionality  of  a  floating-point 
coprocessor,  they  are  too  slow  for  arithmetic-intensive  tasks  such  as 
acoustic  feature  extraction  and  Gaussian  computation.  Therefore, 
we  found  it  necessary  to  rewrite  all  time-critical  computation  using 
integer  data  types  exclusively. 

Two  basic  techniques  for  doing  this  exist:  either  values  can  be 
kept  pre-scaled  by  a  given  factor  (usually  a  power  of  two,  for  the 
best  performance)  or  they  can  be  converted  to  logarithms  (usually 
with  a  base  very  close  to  1.0,  for  the  best  accuracy).  The  choice  de¬ 
pends  primarily  on  the  dynamic  range  of  the  values  in  question  and 
on  the  types  of  operations  that  will  be  performed  on  them.  In  our 
system,  we  calculate  the  Fast  Fourier  Transform  (FFT)  using  signed 
32-bit  integers  with  a  radix  point  at  bit  16,  that  is,  in  Q15.16  format. 
However,  to  calculate  the  MFCC  features,  we  need  to  take  the  power 
spectrum,  whose  dynamic  range  far  exceeds  the  limits  of  this  for¬ 
mat.  Since  we  will  eventually  take  the  log  of  the  spectrum  in  order 
to  compute  the  cepstrum,  we  use  a  logarithmic  representation  for  the 
power  spectrum  and  the  Mel-filter  bank.  Addition  of  logarithms  is 
accomplished  using  a  lookup  table,  shared  with  the  GMM  computa¬ 
tion  component,  which  also  operates  on  integer  values  in  log-space. 

The  use  of  fixed-point  arithmetic  inevitably  involves  some  round¬ 
ing  etxor,  which  is  compounded  by  each  operation  performed.  It  is 
therefore  important  to  choose  algorithms  that  minimize  the  number 
of  operations,  not  only  for  speed,  but  also  to  maintain  accuracy.  For 
example,  one  way  to  optimize  an  FFT  for  real- valued  input  data  is 
to  perform  a  half-length  complex  FFT  on  the  input  data,  then  post¬ 
process  the  output  to  separate  the  real  and  imaginary  parts[9].  How¬ 
ever,  when  this  is  done  in  fixed-point,  the  added  processing  leads 
to  eixors  that  can  significantly  increase  the  word  error  rate,  in  some 
cases  by  up  to  20%  relative. 

Optimization  of  data  and  control  structures:  The  ARM  archi¬ 
tecture  is  heavily  optimized  for  integer  and  Boolean  computation. 
Most  instructions  include  a  “shift  count"  field  that  allows  the  out¬ 
put  operand  to  be  bit-shifted  by  an  immediate  value  without  penalty. 
In  addition,  most  instructions  can  be  nullified  on  any  condition,  al¬ 
lowing  many  short  branches  to  be  eliminated.  Finally,  the  ARM 
is  a  32-bit  load-store  architecture  with  16  general-purpose  registers. 
Therefore  it  is  important  to  keep  data  in  registers  while  performing 
intensive  computations,  it  is  always  faster  to  access  memory  32  bits 
at  a  time,  and  unaligned  accesses  must  be  avoided  at  all  costs.  In 
general,  a  good  optimizing  compiler  can  make  efficient  use  of  the 
register  file,  but  in  some  cases  it  is  necessary  to  manually  unroll 
loops  in  order  to  generate  the  most  efficient  code. 

As  a  case  in  point,  a  large  percentage  of  CPU  time  in  the  SPHINX- 
II  system  is  spent  in  the  maintenance  of  the  list  of  active  senones  to 
be  computed  for  each  frame  of  input.  In  the  baseline  system,  this 
list  is  generated  by  setting  flags  in  an  array  of  bytes  which  is  then 
scanned  to  produce  an  array  of  senone  IDs.  This  arrangement  is 
sensible  since  there  are  typically  many  fewer  active  senones  than  to¬ 
tal  senones.  However,  the  byte-array  representation  places  greater 
load  on  the  processor’s  cache,  and  also  involves  byte-wide  accesses 
that  are  slow  on  CPUs  such  as  ARM  and  PowerPC.  Therefore,  we 
changed  the  representation  to  be  a  bit  vector,  and  unrolled  the  loop 
that  scans  this  bit  vector  to  operate  on  one  32-bit  word  at  a  time.  This 
also  allows  it  to  skip  entire  blocks  of  32  senones  in  the  case  where 
the  active  list  is  very  sparse.  In  examining  the  generated  assembly 
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code,  we  find  that  it  is  now  very  efficient:  only  4  instructions  are 
used  to  check  each  senone  in  the  bit  vector  and  conditionally  add  it 
to  the  list. 

4.  ALGORITHMIC  OPTIMIZATIONS 

After  completing  the  platform  and  software  optimizations  detailed  in 
the  last  section,  we  examined  the  output  of  the  GNU  gprof  source- 
level  profiling  tool  to  determine  where  to  look  for  more  principled 
speed-ups.  We  found  that  the  bulk  of  computation  was  spent  in  four 
areas:  acoustic  feature  (MFCC)  calculation,  Gaussian  (codebook) 
computation,  Gaussian  mixture  model  (senone)  computation,  and 
HMM  evaluation  (Viterbi  search)3.  The  approximate  proportion  of 
time  spent  in  these  four  areas  is  shown  in  Table  1 . 


Component 

Desktop 

Embedded 

Codebook 

27.43% 

24.59% 

HMM 

24.68% 

22.11% 

MFCC 

14.39% 

11.51% 

Senone 

7.67% 

11.71% 

Table  1.  Percentage  of  time  spent  in  selected  tasks 

In  our  algorithmic  optimizations,  we  concentrated  primarily  on 
Gaussian  mixture  model  (GMM)  computation,  since  in  previous  work 
[6]  we  have  developed  a  well-reasoned  framework  for  approximate 
GMM  computation.  In  this  framework,  GMM  evaluation  is  divided 
into  4  layers  of  computation: 

1.  Frame  layer:  all  GMM  computation  for  an  input  frame. 

2.  GMM  layer:  computation  of  a  single  GMM. 

3.  Gaussian  layer:  computation  of  a  single  Gaussian. 

4.  Component  layer:  computation  related  to  one  component  in 
the  feature  vector  (assuming  a  diagonal  covariance  matrix). 

This  framework  allows  a  straightforward  categorization  of  dif¬ 
ferent  speed-up  techniques  by  the  layer(s)  on  which  they  operate, 
and  allows  us  to  determine  how  different  techniques  can  be  applied 
in  combination  with  each  other.  However,  this  framework,  as  with 
much  other  work  in  approximate  GMM  calculation,  applies  primar¬ 
ily  to  systems  using  continuous  distribution  HMMs  (CDHMM).  In 
application  of  the  idea  to  semi-continuous  HMMs  (SCHMM),  sev¬ 
eral  differences  should  be  noted: 

•  In  the  full  computation  of  semi-continuous  acoustic  models, 
a  single  “codebook”  of  Gaussian  densities  is  shared  between 
all  mixture  models. 

•  The  number  of  mixture  Gaussians  is  usually  128  to  2048, 
which  is  much  larger  than  the  16  to  32  densities  used  for  each 
mixture  in  a  typical  CDHMM  system. 

•  SCHMM-based  systems  usually  represent  the  feature  vector 
with  multiple  independent  streams. 

Therefore,  though  the  four-layer  model  still  applies,  the  layers 
beneath  the  frame  layer  are  structured  differently.  In  particular,  since 
the  codebook  is  shared  between  all  GMMs,  the  entire  codebook  must 
be  computed  at  every  frame.  This  limits  the  degree  to  which  approx¬ 
imations  at  the  GMM  layer  can  reduce  computation.  We  applied  the 
following  techniques  to  each  layer: 

’in  fact,  74%  of  the  total  running  time  is  spent  in  only  13  functions! 


•  Frame  layer:  We  applied  frame-based  downsampling  (see 
[10]).  Although  this  inevitably  results  in  a  loss  of  accuracy, 
it  is  the  only  way  we  found  to  achieve  a  speed-up  above  the 
Gaussian  layer. 

•  GMM  layer:  We  attempted  to  apply  context-independent 
GMM-based  GMM  selection  ([11],  [7]).  However,  we  found 
that  the  overhead  of  this  technique  far  outweighed  the  reduc¬ 
tion  in  GMM  computation. 

•  Gaussian  layer:  We  considered  several  possibilities  such  as 
Sub-VQ-based  Gaussian  selection[8],  but  all  of  these  involve 
significant  overhead.  Therefore,  we  decided  to  use  a  fast  tree- 
based  approach  to  Gaussian  selection. 

•  Component  layer:  SPHINX-II  already  implements  a  form  of 
partial  Gaussian  computation[12].  We  used  information  from 
the  tree-based  Gaussian  selection  to  improve  the  efficiency  of 
this  algorithm. 

The  computation  of  the  codebook  is  already  relatively  quick  in 
SPHINX-11.  At  each  frame,  the  previous  frame's  top-iV  scoring  code¬ 
words  (for  some  small  N,  typically  2  or  4)  are  recomputed  and  the 
resulting  distances  are  used  as  a  threshold  for  partial  computation 
of  the  remaining  codewords[12].  Computation  of  senones  has  also 
already  been  optimized,  by  transposing  the  mixture  weight  arrays 
in  memory  and  quantizing  them  to  8-bit  integer  values[4],  as  well 
as  by  providing  separately  optimized  functions  for  each  of  the  most 
common  top-iV  values. 

In  the  frame  layer,  we  initially  applied  frame-based  downsam¬ 
pling  in  a  straightforward  manner,  by  simply  skipping  all  codebook 
and  GMM  computation  at  every  other  frame.  However,  we  later 
modified  this  to  recompute  the  top-N  Gaussians  from  the  previous 
frame  and  use  these  to  compute  the  active  senones  from  the  current 
frame.  This  is  analogous  to  the  “tightening”  of  the  Cl-GMM  se¬ 
lection  beam  that  is  implemented  in  SPHINX-III  0.6  [7],  We  found 
that  this  was  actually  faster  by  a  very  small  margin  (0.6%)  and  also 
resulted  in  a  10%  relative  decrease  in  the  word  error  rate. 

In  the  Gaussian  layer,  we  applied  a  modified  version  of  the  bucket 
box  intersection  algorithm,  as  described  in  [13].  This  algorithm  or¬ 
ganizes  the  set  of  Gaussians  in  a  A:d-tree  structure  which  allows  a 
fast  search  for  the  subset  of  Gaussians  closest  in  the  feature  space  to 
a  given  feature  vector.  For  each  acoustic  feature  stream  in  the  code¬ 
book,  we  build  a  separate  tree  of  arbitrary  depth  (typically  depth  8 
or  10,  to  reduce  storage  requirements)  with  a  given  relative  Gaussian 
box  threshold.  At  each  frame,  after  computing  the  previous  frame’s 
top-iV  codewords,  we  search  the  fcd-tree  to  find  a  shortlist  of  Gaus¬ 
sians  to  be  partially  computed. 

Though  the  trees  are  built  off-line,  the  depth  to  search  in  the  tree 
can  be  controlled  as  a  parameter  to  the  decoder  at  run-time.  This 
allows  the  memory  requirement  for  the  trees  to  be  quite  small,  since 
the  shortlists  of  Gaussians  need  only  be  stored  at  the  leaf-nodes.  We 
also  explored  the  idea  of  limiting  the  maximum  number  of  Gaus¬ 
sians  to  be  searched  in  each  leafnode.  In  order  to  make  this  feasible, 
we  sorted  the  list  of  Gaussians  in  a  leafnode  by  their  “closeness”  to 
the  fc-dimensional  region,  or  bucket  box,  delimited  by  that  node.  We 
found  that  an  appropriate  criterion  is  the  log-ratio  of  the  total  vol¬ 
ume  of  the  individual  Gaussian's  bucket  box  to  the  area  in  which  it 
overlaps  with  the  leafnode's  bucket  box. 

5.  EXPERIMENTAL  RESULTS 

The  result  of  applying  various  optimizations  in  sequence  is  shown  in 
Table  2.  As  expected,  the  largest  speed  gain  we  were  able  to  achieve 
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xRT 

Speed-up 

WER 

A 

Baseline 

7.15 

0 

9.73% 

0 

w/Fixed-point 

2.68 

2.67 

10.06% 

+3.4% 

w/Real  FFT 

2.55 

1.05 

10.06% 

0 

w/Log  Approx 

2.29 

1.11 

10.75% 

+6.9% 

w/Assembly 

1.60 

1.43 

10.69% 

-0.6% 

w/Top-2  Gaussians 

1.40 

1.14 

11.57% 

+8.2% 

w/Viterbi-only 

1.06 

1.32 

12.61% 

+8.9% 

w/Downsampling 

1.00 

1.06 

13.29% 

+5.4% 

w/Beam-tuning 

0.89 

1.12 

14.61% 

+9.9% 

w/fcd-trees 

0.87 

1.02 

13.95% 

-4.5% 

Table  2.  Performance  and  accuracy  on  994-word  RM  task  after  suc¬ 
cessive  optimizations 


came  from  the  use  of  fixed-point  arithmetic.  Simply  reimplementing 
the  existing  acoustic  feature  extraction  in  fixed-point  resulted  in  a 
2.7-fold  gain  in  speed.  The  loss  of  precision  caused  a  slight  increase 
in  the  word  error  rate,  from  9.73%  to  10.06%. 

The  only  algorithmic  “free  lunch”  came  from  the  use  of  a  FFT 
algorithm  specialized  for  real-valued  inputs  [9].  After  speeding  up 
the  FFT,  the  fixed-point  logarithm  function  used  in  MFCC  calcula¬ 
tion  became  a  bottleneck,  so  we  reduced  its  precision,  resulting  in 
a  significant  gain  in  speed,  albeit  with  a  reduction  in  accuracy.  We 
then  reimplemented  the  fixed-point  multiply  operation  using  inline 
ARM  assembly  language,  giving  another  large  boost  in  speed  with 
no  degradation  in  accuracy. 

We  changed  several  decoder  parameters  to  their  “faster”  val¬ 
ues  in  order  to  boost  the  speed  of  the  system.  The  number  of  top- 
N  Gaussians  used  to  calculate  the  senone  scores  was  reduced  to  2 
(and  a  loop  in  the  top-2  senone  computation  function  was  unrolled). 
SPHINX-II  uses  a  multi-pass  decoding  strategy,  performing  a  fast 
forward  Viterbi  search  using  a  lexical  tree,  followed  by  a  flat-lexicon 
search  and  a  best-path  search  over  the  resulting  word  lattice.  In  order 
to  get  the  best  performance,  we  disabled  the  latter  two  passes. 

Next,  we  applied  partial  frame  downsampling,  then  used  a  sep¬ 
arate  held-out  set  of  200  utterances  to  find  the  optimal  widths  for 
the  various  beams  used  in  decoding,  in  order  to  reduce  the  amount 
of  time  spent  in  Viterbi  search.  We  then  used  the  same  held-out  set 
to  find  the  optimal  threshold  and  depth  of  fcd-trees  to  use.  To  our 
surprise  the  use  of  fcd-trees  actually  reduced  the  error  rate  slightly. 

The  final  system  has  a  word  error  rate  of  13.95%  on  our  test 
set,  degraded  by  43.4%  relative  to  the  baseline  system.  We  are  en¬ 
couraged  by  the  fact  that  the  largest  sources  of  degradation  were  not 
related  to  our  algorithmic  optimizations,  but  rather  from  overly  zeal¬ 
ous  tuning  of  the  search  parameters.  Such  tuning  could  be  relaxed 
for  more  recent  processors  or  can  be  adjusted  for  different  tasks.  It 
is  also  likely  that  with  better  acoustic  modeling  and  cross-validation, 
these  errors  could  be  reduced  or  eliminated.  In  addition,  this  final 
system  exhibited  an  8-fold  reduction  in  CPU  usage  from  the  baseline 
system  and  a  3 -fold  reduction  front  the  baseline  machine-optimized 
system. 

6.  CONCLUSION  AND  FUTURE  WORK 

In  this  paper,  we  present  a  1000-word  vocabulary  system  operating 
at  under  1  xRT  on  a  206  MHz  hand-held  device.  The  system  in 
question  has  been  released  as  open  source  code  and  is  available  at 
http://www.pocketsphinx.org/.  PocketSphinx  inher¬ 
its  the  easy-to-use  API  from  SPHINX-II,  and  should  be  useful  to 
many  other  developers  and  researchers  in  the  speech  community. 


In  future,  we  will  apply  this  system  to  a  task  with  a  higher  per¬ 
plexity  language  model  and  larger  vocabulary.  A  candidate  for  fur¬ 
ther  optimization  is  the  Viterbi  search  algorithm,  which  we  have  not 
discussed  in  depth  in  this  paper.  Such  a  system  will  support  de¬ 
velopment  of  additional,  more  interesting  applications.  We  are  also 
working  on  a  port  of  PocketSphinx  to  the  popular  Windows(g)CE 
operating  system  and  Pocket  PC  hardware. 
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